Goto

Collaborating Authors

 iris data


High-Dimensional Data Classification in Concentric Coordinates

Williams, Alice, Kovalerchuk, Boris

arXiv.org Artificial Intelligence

Alice Williams Department of Computer Science Central Washington University USA 0009 - 0001 - 6154 - 2407 Boris Kovalerchuk Department of Computer Science Central Washington University USA 0000 - 0002 - 0995 - 9539 Abstract -- The v isualization of multi - dimensional data with interpretable methods remains limited by ca pabilities for both high - dimensional lossless visualizations that do not suffer from occlusion and that are computationally capable by parameterized visualization . This paper proposes a low to high dimensional data supporting framework using lossless C oncentric C oordinates that a re a more compact generalization of Parallel Coordinate s along with former C ircular C oordinates . These are forms of the General Line Coordinate visualizations that can directly support machine learning algorithm visualization and facilitate human inter action . A. Motivation In many domains, accurate and interpretable classification models can be accurately visualized. However, in many other domains, this remains a long - standing and critical roadblock to deploy artificial intelligence and machine learning (AI/ML) models. This is critica l and challenging for high - risk tasks like healthcare diagnostics. Visualization of multidimensional (n - D) data classification is critical for three major reasons: (1) to speed up analysis of prediction accuracy, (2) to interpret/explain classifier predictions, and (3) to improve/modify the prediction model. B. Overview of Existing Methods AI/ ML tasks for high multi - dimensional (n - D) data are commonly approached with black - box deep - learning (DL) methods that inherently lack in i nterpretability and decision explanation. Further relying on explainability after model design as popularly done with either LIME or SHAP [ 7 ]. Moreover, visualization methods used commonly pre process data with dimensional reduction (DR) methods like Principal Component Analysis (PCA), t - Stochastic Neighbor Embedding (t - SNE), or other similar approximations. However, s uch methods are lossy and not reversible. Therefore, these methods commonly introduce visual ly verify inaccuracies in n - D. Alternatively, lossless visualizations allow for the use of Visual Knowledge Discovery (VKD) to visually discover algorithmic adjustments that improve ML prediction models [ 5 ] .


Extending Explainable Ensemble Trees (E2Tree) to regression contexts

Aria, Massimo, Gnasso, Agostino, Iorio, Carmela, Fokkema, Marjolein

arXiv.org Machine Learning

Ensemble methods such as random forests have transformed the landscape of supervised learning, offering highly accurate prediction through the aggregation of multiple weak learners. However, despite their effectiveness, these methods often lack transparency, impeding users' comprehension of how RF models arrive at their predictions. Explainable ensemble trees (E2Tree) is a novel methodology for explaining random forests, that provides a graphical representation of the relationship between response variables and predictors. A striking characteristic of E2Tree is that it not only accounts for the effects of predictor variables on the response but also accounts for associations between the predictor variables through the computation and use of dissimilarity measures. The E2Tree methodology was initially proposed for use in classification tasks. In this paper, we extend the methodology to encompass regression contexts. To demonstrate the explanatory power of the proposed algorithm, we illustrate its use on real-world datasets.


Synthetic Data Generation and Automated Multidimensional Data Labeling for AI/ML in General and Circular Coordinates

Williams, Alice, Kovalerchuk, Boris

arXiv.org Artificial Intelligence

Insufficient amounts of available training data is a critical challenge for both development and deployment of artificial intelligence and machine learning (AI/ML) models. This paper proposes a unified approach to both synthetic data generation (SDG) and automated data labeling (ADL) with a unified SDG-ADL algorithm. SDG-ADL uses multidimensional (n-D) representations of data visualized losslessly with General Line Coordinates (GLCs), relying on reversible GLC properties to visualize n-D data in multiple GLCs. This paper demonstrates use of the new Circular Coordinates in Static and Dynamic forms, used with Parallel Coordinates and Shifted Paired Coordinates, since each GLC exemplifies unique data properties, such as interattribute n-D distributions and outlier detection. The approach is interactively implemented in computer software with the Dynamic Coordinates Visualization system (DCVis). Results with real data are demonstrated in case studies, evaluating impact on classifiers.


General Line Coordinates in 3D

Martinez, Joshua, Kovalerchuk, Boris

arXiv.org Artificial Intelligence

Interpretable interactive visual pattern discovery in lossless 3D visualization is a promising way to advance machine learning. It enables end users who are not data scientists to take control of the model development process as a self-service. It is conducted in 3D General Line Coordinates (GLC) visualization space, which preserves all n-D information in 3D. This paper presents a system which combines three types of GLC: Shifted Paired Coordinates (SPC), Shifted Tripled Coordinates (STC), and General Line Coordinates-Linear (GLC-L) for interactive visual pattern discovery. A transition from 2-D visualization to 3-D visualization allows for a more distinct visual pattern than in 2-D and it also allows for finding the best data viewing positions, which are not available in 2-D. It enables in-depth visual analysis of various class-specific data subsets comprehensible for end users in the original interpretable attributes. Controlling model overgeneralization by end users is an additional benefit of this approach.


Criticality Analysis: Bio-inspired Nonlinear Data Representation

Scheper, Tjeerd V. olde

arXiv.org Artificial Intelligence

The representation of arbitrary data in a biological system is one of the most elusive elements of biological information processing. The often logarithmic nature of information in amplitude and frequency presented to biosystems prevents simple encapsulation of the information contained in the input. Criticality Analysis (CA) is a bio-inspired method of information representation within a controlled self-organised critical system that allows scale-free representation. This is based on the concept of a reservoir of dynamic behaviour in which self-similar data will create dynamic nonlinear representations. This unique projection of data preserves the similarity of data within a multidimensional neighbourhood. The input can be reduced dimensionally to a projection output that retains the features of the overall data, yet has much simpler dynamic response. The method depends only on the rate control of chaos applied to the underlying controlled models, that allows the encoding of arbitrary data, and promises optimal encoding of data given biological relevant networks of oscillators. The CA method allows for a biologically relevant encoding mechanism of arbitrary input to biosystems, creating a suitable model for information processing in varying complexity of organisms and scale-free data representation for machine learning.


Clustering -- Basic concepts and methods

Kapp-Joswig, Jan-Oliver Felix, Keller, Bettina G.

arXiv.org Artificial Intelligence

We review clustering as an analysis tool and the underlying concepts from an introductory perspective. What is clustering and how can clusterings be realised programmatically? How can data be represented and prepared for a clustering task? And how can clustering results be validated? Connectivity-based versus prototype-based approaches are reflected in the context of several popular methods: single-linkage, spectral embedding, k-means, and Gaussian mixtures are discussed as well as the density-based protocols (H)DBSCAN, Jarvis-Patrick, CommonNN, and density-peaks.


Shape complexity in cluster analysis

Aguilar, Eduardo J., Barbosa, Valmir C.

arXiv.org Artificial Intelligence

In cluster analysis, a common first step is to scale the data aiming to better partition them into clusters. Even though many different techniques have throughout many years been introduced to this end, it is probably fair to say that the workhorse in this preprocessing phase has been to divide the data by the standard deviation along each dimension. Like division by the standard deviation, the great majority of scaling techniques can be said to have roots in some sort of statistical take on the data. Here we explore the use of multidimensional shapes of data, aiming to obtain scaling factors for use prior to clustering by some method, like k-means, that makes explicit use of distances between samples. We borrow from the field of cosmology and related areas the recently introduced notion of shape complexity, which in the variant we use is a relatively simple, data-dependent nonlinear function that we show can be used to help with the determination of appropriate scaling factors. Focusing on what might be called "midrange" distances, we formulate a constrained nonlinear programming problem and use it to produce candidate scaling-factor sets that can be sifted on the basis of further considerations of the data, say via expert knowledge. We give results on some iconic data sets, highlighting the strengths and potential weaknesses of the new approach. These results are generally positive across all the data sets used.


Using SingleStoreDB, MindsDB, and Deepnote - DZone Big Data

#artificialintelligence

This article will show how to use SingleStoreDB with MindsDB using Deepnote. We'll create integrations within Deepnote, load the Iris flower data set into SingleStoreDB, and then use MindsDB to create a Machine Learning (ML) model from the Iris data stored in SingleStoreDB. We'll also make some example predictions using the ML model. Most of the code will be in SQL, enabling developers with solid SQL skills to hit the ground running and start working with ML immediately. The notebook file used in this article is available on GitHub.


Clustering performance analysis using new correlation based cluster validity indices

Wiroonsri, Nathakhun

arXiv.org Machine Learning

There are various cluster validity measures used for evaluating clustering results. One of the main objective of using these measures is to seek the optimal unknown number of clusters. Some measures work well for clusters with different densities, sizes and shapes. Yet, one of the weakness that those validity measures share is that they sometimes provide only one clear optimal number of clusters. That number is actually unknown and there might be more than one potential sub-optimal options that a user may wish to choose based on different applications. We develop two new cluster validity indices based on a correlation between an actual distance between a pair of data points and a centroid distance of clusters that the two points locate in. Our proposed indices constantly yield several peaks at different numbers of clusters which overcome the weakness previously stated. Furthermore, the introduced correlation can also be used for evaluating the quality of a selected clustering result. Several experiments in different scenarios including the well-known iris data set and a real-world marketing application have been conducted in order to compare the proposed validity indices with several well-known ones.


Clustergam: visualisation of cluster analysis – Martin Fleischmann

#artificialintelligence

In this post, I introduce a new Python package to generate clustergrams from clustering solutions. The library has been developed as part of the Urban Grammar research project, and it is compatible with scikit-learn and GPU-enabled libraries such as cuML or cuDF within RAPIDS.AI. When we want to do some cluster analysis to identify groups in our data, we often use algorithms like K-Means, which require the specification of a number of clusters. But the issue is that we usually don't know how many clusters there are. There are many methods on how to determine the correct number, like silhouettes or elbow plot, to name a few.